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We study semiparametric efficiency bounds and efficient estima- 
tion of parameters defined tfirougfi generai moment restrictions witli 
missing data. Identification relies on auxiliary data containing infor- 
mation about the distribution of the missing variables conditional on 
proxy variables that are observed in both the primary and the aux- 
iliary database, when such distribution is common to the two data 
sets. The auxiliary sample can be independent of the primary sam- 
ple, or can be a subset of it. For both cases, we derive bounds when 
the probability of missing data given the proxy variables is unknown, 
or known, or belongs to a correctly specified parametric family. We 
find that the conditional probability is not ancillary when the two 
samples are independent. For all cases, we discuss efficient semipara- 
metric estimators. An estimator based on a conditional expectation 
projection is shown to require milder regularity conditions than one 
based on inverse probability weighting. 

1. Introduction. Many empirical studies are complicated by the presence 
of missing data. One solution to the identification problem is based on the 
assumption that information on the true value of the variables in the data 
set of interest (the primary data set) can be recovered using auxiliary data 
sources under a conditional independence assumption. The key element of 
this identification strategy is that the distribution of the variables of interest 
is assumed to be independent of whether they belong to the primary or the 
auxiliary sample, conditional on a set of proxy variables, which are observed 
in both samples. 
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The first goal of tliis paper is to study semiparametric efficiency bounds 
of parameters defined tlirougli general nonlinear and over-identified moment 
conditions for missing data models under a conditional independence as- 
sumption. We provide semiparametric efficiency bounds for the cases when 
the propensity score is unknown, or is known, or belongs to a correctly spec- 
ified parametric family. In our context, the propensity score is defined as the 
probability that one observation belongs to the subsample where only the 
proxy variables are observed. The auxiliary sample can be either a subset 
of the primary sample ("verify-in-sample" case) or independent of the pri- 
mary sample ( "verify-out-of-sample" ) . The former case is a special case of 
the MAR or CAR missing data structure where the missing variables are 
common to all subjects. Semiparametric efficiency bounds for this case are 
closely related to the results in Robins, Rotnitzky and Zhao (1994) when 
there is a single hierarchy in the case of monotone missing data patterns for 
a fixed set of instrument functions. [See also Robins and Rotnitzky (1995) 
and Chen and Breslow (2004).] We provide new results on semiparametric 
efficiency bounds for the "verify-out-of-sample" case. We find that while 
more information on the propensity score will not affect the asymptotic effi- 
ciency bounds for parameters defined in the verify-in-sample case [as shown 
in, e.g., Robins, Rotnitzky and Zhao (1994), Chen and Breslow (2004) and 
Hahn (1998)], it will improve the asymptotic efficiency for parameters de- 
fined in the verify-out-of-sample case. Our new efficiency bound results for 
the case when the parametric propensity is correctly specified should be 
useful in applied work because such an assumption is frequently adopted by 
empirical researchers. 

The second goal of the paper is to develop two classes of sieve-based. Gen- 
eralized Method of Moments (GMM) estimators that achieve the efficiency 
bounds for parameters defined under either the "verify-out-of-sample" or the 
"verify-in-sample" framework. Each estimator relies only on one nonpara- 
metric estimate; a conditional expectation projection based GMM (hereafter 
CEP-GMM) estimator only requires the nonparametric estimation of a con- 
ditional expectation, while an inverse probability weighting based GMM 
(hereafter IPW-GMM) estimator only needs a nonparametric estimate of the 
propensity score. We establish asymptotic normality and efficiency proper- 
ties of both estimators under weaker regularity conditions than the existing 
ones in the literature. In particular, we allow for nonlinear and nonsmooth 
moment restrictions and for unbounded support of conditioning (or proxy) 
variables. The CEP-GMM estimator presents some advantages over the 
IPW-GMM estimator. First, its root-n asymptotic normality and efficiency 
can be derived without the strong assumption that the unknown propensity 
score is uniformly bounded away from zero and one. Second, the CEP-GMM 
estimator is characterized by a simple common format that achieves the rel- 
evant efficiency bound for all the cases we consider, regardless of whether the 
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propensity score is unknown, or known or parametrically specified. Instead, 
the IPW-GMM estimator wih be generahy inefficient when the propensity 
score is known, or is parametrically estimated using a correctly specified 
parametric model; in such instances, different combinations of nonparamet- 
ric and parametric estimates of the propensity score have to be specifically 
derived to achieve the efficiency bounds. 

Our results can also be applied to the estimation of parametric nonlinear 
models with nonclassical measurement errors with validation data, a topic 
that has been studied in Carroll and Wand (f991), Sepanski and Carroll 
(1993), Carroll, Ruppert and Stefanski (1995), Lee and Sepanski (1995), 
Chen, Hong and Tamer (2005) among others. 

Section 2 describes the model and presents the semiparametric efficiency 
bounds. Semiparametrically efficient CEP-GMM and IPW-GMM estimators 
are developed in Sections 3 and 4, respectively. In Section 5 we illustrate 
empirically the performance of the different estimators in the estimation of 
the distribution of private consumption in rural India in the presence of 
missing data. Section 6 concludes. All proofs are given in the Appendices. 

2. Semiparametric efficiency bounds. Let {Xi,Yi, Di)"^^-^ be an i.i.d. sam- 
ple from {X,Y,D), and denote Zi = (Yi,Xi) where Yi is only observed when 
Di = 0. We are interested in the estimation of parameters P £ B, a compact 
subset of TZ'^f , defined implicitly in terms of general nonlinear moment con- 
ditions. In the first (verify-out-of-sample) case such conditions are described 



where Z = {Y,X) and m(-;f3) is a set of functions with dimension dm > dp. 

In other words, under case (1) Y is always missing in the primary data set 
{D = 1), which is a random sample from the population of interest, while 
an independent auxiliary sample (where D = 0) will serve the purpose of 
ensuring the identification of parameters that would not be identified by the 
primary data set alone. Under case (2), the auxiliary sample is instead a 
subset of the entire primary sample. 

In this section we present the semiparametric efficiency bound for the 
estimation of /3 implicitly defined by either moment conditions (1) or (2). In 
this paper (3 is typically used to denote an arbitrary value in the parameter 
space, but to save notation in this section /? is also used as the true parameter 
value Pq. Define 




(3) 



£{X-(3) = E[7n{Z-0)\X] 
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to be the conditional expectation of the moment conditions given X, and 
define 

(4) V{m{Z; I3)\X) = E[m{Z; l3)m{Z; py \ X] - £{X; P)£{X; P)' 

to be the conditional variance of the moment conditions given X. In addition, 
define 

p = Pr(L> = l) and p{X) =Fv{D = 1\ X), 
jl = —E[m{Z-l3)\D = l] and = —E[m{Z- [3)]. 

Assumption 1. (i) Both and j'^ have full column rank equal to dp] 
(ii) The data {Xi,Yi, Di)f^-^ is an i.i.d. sample from {X, Y, D); (iii) p G (0, 1), 
p{X)G{0,l). 

Notice that in both cases (1) and (2) the moment conditions are assumed 
to hold in the primary sample in which some information is missing. Identifi- 
cation is possible because of the access to an auxiliary data set {D = 0) which 
contains both yand a set of proxy variables X that are also potentially of 
interest, if the following fundamental conditional independence assumption 
holds: 

Assumption 2. Y ±D \ X. 

Conditional independence assumptions have been used extensively in econo- 
metrics and statistics to achieve identification with missing data. Examples 
include inference in models with attrition or nonresponse [e.g.. Little and Rubin 
(2002), Robins and Rotnitzky (1995), Rotnitzky and Robins (1995), Wooldridge 
(2002), Wooldridge (2003)], the estimation of treatment effects [see e.g., the 
references surveyed in Heckman, LaLonde and Smith (1999)], the recovery 
of comparability over time of statistics calculated using data collected with 
different methodology [e.g., Clogg, Rubin, Schenker, Schultz and Weidman 
(1991), Schenker (2003), Tarozzi (2007)]. 

Under case (2), Assumption 1 would be satisfied if, for instance, the 
probability of validating a given observation only depends on X. In case 
(1), Assumption 2 requires that the sampling scheme used to create the 
auxiliary sample depends only on X. If a simple random subset of the 
primary data is validated, p{X) is a constant and the auxiliary data set 
is characterized by the same distribution of {Y,X) as the primary data 
set, and Assumption 2 is easily seen satisfied. In this case, which is com- 
mon in the statistics literature, the auxiliary data set is usually called a 
validation data set. A stratified sample satisfying Assumption 2 in model 
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(2) can also be produced through a two-stage samphng design using a fi- 
nite number of strata [see e.g., Breslow, Robins and Wellner (2000) and 
Breslow, McNeney and Wellner (2003)], in which case the only variable X 
that is observed for all sampled observations is a discrete stratum indi- 
cator. In the following, the "regular estimators" are defined according to 
Begun, Hall, Huang and Wellner (1983) and Ibragimov and Has'minskii (1981) 



Theorem 1. Let (3 he defined by the moment conditions (1) or (2). 
Under Assumptions 1-2, the asymptotic variance lower hound for all regular 
estimators of (3 is 

{JpQ.'^'^Jls)'^ for some and some positive definite 0/?, 
where, for the moment condition (1), Jp = and $7^ = fi^.- 



E 



P^^^' -V{m{Z; P)\X) + ^£{X; P)£{X; /?)' 



p\l-p{X)) 



p" 



and for the moment condition (2), = and Q.p = 0^; 



02 



1 



l-p{X) 



V[m{Z;(3) \ X]+ £{X-l3)£{X-(3)' 



In Appendix A we present explicit expressions for the efficient score func- 
tions corresponding to the asymptotic variance lower bounds in Theorem 1 
as well as in the following Theorems 2 and 3. 

2.1. Information content of the propensity score. It is interesting to an- 
alyze whether the knowledge of the propensity score p{X) decreases the 
semiparametric efficiency bounds for the parameters (3. 



Theorem 2. Let (3 he defined by the moment conditions (1) or (2). 
Under Assumptions 1-2, if p{X) is known, then the asymptotic variance 
lower bound for all regular estimators of (3 is 

{JpQ-lj'^Jp)'^ for some Jp and some positive definite VLp^ 
where, for the moment condition (1), Jfi = and = t)}^: 



-V{m{Z; P)\X) + ^^£{X; P)£{X; 13)' 



p\\-p{X)) 



pi 



and for the moment condition (2), = J7| and 0^ = 0^ given in Theorem 



1. 
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In other words, knowledge of p{X) reduces the semiparametric efficient 
variance bound for f] under the "verify-out-of-sample" case, but it does 
not under the "verify-in-sample" case. The following argument provides an 
intuition for this result. When (2) holds, (3 is defined through the relation 

m{y, X- (3)f{y \ x) dyf{x) dx = 0. 

The propensity score p{X) does not enter the definition of (5, therefore its 
knowledge should not affect the variance bound for (3. However, the relation 
that identifies (3 when (1) holds clearly depends on p{X): 

m{y, x; I3)p{x)f{y \ x) dyf{x) dx = 0. 

Remark 1. A special case of Theorem 2 is when p{X) is a constant p. In 
this case, the auxiliary sample is also called a validation sample and is drawn 
randomly from the same population as the primary sample, so that y, X _L D 
[Carroll and Wand (1991), Sepanski and Carroll (1993), Lee and Sepanski 
(1995)]. In such case it is then easy to see that the two efficiency bounds 
given in Theorem 2 become identical. 

Another interesting question is what is the efficiency bound for the es- 
timation of (3 defined by moment condition (1) if the propensity score is 
unknown but is assumed to belong to a correctly specified parametric fam- 
ily, so that p{X) =p{X]^). Let p-yiX) = dp{X ; ^y) / dj , and define the score 
function for 7 as 5^ = S^{D,X) = ;7)) P7(-^)- 

Theorem 3. Let (3 he defined by the moment conditions (1). Under 
Assumptions 1-2, if p{X) = p{X;'y) and E[S-y{D,X)S^{D,Xy] is positive 
definite, then the asymptotic variance lower bound for all regular estimators 
of [3 is {Jp^'^^ Jj3)~^ , where Ji3 = J^, 0^ is given in Theorem 2 and 

,p^{X)8{X-(3)'- 



Vtfj — fi^ -|- 



■^ £{X-(3)p,{X)' 
P 



p 



This variance bound is clearly larger than (l^ stated in Theorem 2, but 
it is smaller than the bound in Theorem 1. This latter result can be verified 
noting first that the bound in Theorem 3 corresponds to the variance of the 
following influence function: 

\ p ' J p 
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where we use Proj(Zi|Z2) to denote the population least squares projection 
of a random variable Z\ onto the linear space spanned by Z2 ■ The conclusion 
follows noting that the variance bound stated in Theorem 1 for moment 
condition (1) is instead the variance of the following influence function 



whose corresponding variance is larger. 

Our results for GMM models complement and extend the finding in the 
program evaluation literature that knowing the propensity score decreases 
the efficient variance bound for the estimation of the average effect of treat- 
ment on the treated, while the propensity score is ancillary for the average 
treatment effect parameter [Hahn (1998)]. 

3. CEP-GMM estimation. In this section, we consider a first class of 
semiparametrically efficient estimators based on a conditional expectation 
projection (CEP) method. If Assumption 2 holds, identification follows by 
noting that, under case (1) 



Therefore, estimation of the parameters of interest can proceed by first es- 
timating E[m{Z ; f3)\x , D = 0] nonparametrically from the auxiliary sample, 
and then integrating the conditional expectation against the distribution of 
X in the primary sample. 

3.1. Efficient estimation with unknown propensity score. In the follow- 
ing, we use subscripts p and a to refer respectively to observations belonging 
to the primary and to the auxiliary sample. Let be the size of the pri- 
mary sample and Ua be the size of the auxiliary sample. Observations in the 
primary sample are indexed by i = 1, . . . , rip. Observations in the auxiliary 
sample are indexed by j = 1, . . . , ria. Under moment condition (1) (verify-out- 
of-sample case), n = np + Ua- Under moment condition (2) (verify-in-sample 
case), n = np. Let £{X;(3) denote a nonparametric estimate of £{X;j3) us- 




ing the auxiliary sample. Chen, Hong and Tamer (2005) (hereafter CHT) 
used a sieve Least Squares (LS) estimator. Let {g/(X), / = 1, 2, . . .} denote 
a sequence of known basis functions that can approximate any square- 
measurable function of X arbitrarily well. Also let 




E[m{Z-p) \D = l] 




while under case (2) 





1 



(X) = (r7i(X),...,gfc(„^)(X))' 
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and 

for some integer k(na), with k{na) — > oo and k{na)/n — > when n — > oo. 
Then for each given (3, the sieve LS estimator of £{X;P) is 

A generahzed method of moment estimator for Pq can then be defined as 

(5) p = aTgmiJ—Y^i{Xpi;P)] W ( —Y^£{Xpi- 13)] . 

The \/n-consistency and asymptotic normality of this CEP-GMM estima- 
tor have been established in CHT. Following the proof of their claim (A. 2), 
we have the following asymptotic representation: 

n x-^ - . „ . \/n 



'^p i=i i=i 



na f^JiXaj\D = 0) 



where we use fxp{X) to denote the density of X in the primary data set, 
and Op(l) represents a term that converges to in probability. 

When moment condition (1) holds, n = ?ip + n^, fxp{X) = f{X \ D = 1) 
and 

fXpjX) ^ il-p)p{X) 
fiX\D = 0) pil-p{X))- 

In this case we can also write the influence function for ^J27=i^i-^pi'i(^o) 
as 

1 " r 1 

(6) + (1 - Di)-^^^^^^[rn{Zf,Po)-£{Xf,(3o)]'j 

+ Op{l). 

The proof of Theorem 1 shows that the two terms in the influence function 
correspond to the two components of the efficient influence function that 
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contain information about f{X\D = 1) and f{Y \ X), respectively. These 
two terms are orthogonal to each other, so that 



''P i=i 



where 0^ is given in Theorem 1. 

When moment condition (2) holds, fxp{X) = f{X), np = n and 

fx,{X) _ I-P 
fiX\D = 0) 1-piX)- 

The influence function for Z]i=i "^(^p*' /^o) can then be written as 



^ (X,; /3o) + (1 - Di)- L^lmiZf, (3o) - £iXf,f3o 

(7) 

+ Op{l). 



The two terms in the influence function correspond to the two components 
of the projected efficiency influence function that contain information about 
f{X) and f{Y\X) in the proof of Theorem 1. The orthogonality between 
these two terms implies that 



where fi^ is given in Theorem 1. The semiparametric efficiency bounds given 
in Theorem 1 are then achieved by an optimally weighted GMM estimator 
/3 for /3o that uses a weighting matrix W = + Op(l). 

Theorem 4. Let (3 be the CEP- GMM estimator given in (5). Under 
Assumptions 1-2, and Assumptions 3-5 of CHT, we have \/n(f3 — /3o) =^ 
M{0,V), withV = {J'pWJp)-^J'pWVtpWJp{J'pWJp)~^, where ftp is given 

in Theorem 1. Furthermore, ifW = ^l'^^, then ^/n{j3 — (3^) ^ M{^,Vq) , with 
Vo = (iTgil^^iT/j)^^, where Jf^ = J'^ and f]^ = $7^ under moment condition 
(1), and = and = $7^ under moment condition (2). 

CHT derive the root-n consistency and normality of the CEP-GMM es- 
timator. Theorem 4 says that their estimator f3 is also semiparametrically 
efficient. The proof of Theorem 4 follows directly from that of Theorem 
2 in CHT, who also provide simple consistent estimators of V and Vq. 
In the working paper version of this article, we have stated Assumptions 
3-5 of CHT in terms of the notations of this paper. These assumptions 
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are mild regularity conditions. In particular, they allow for (i) nonsmooth 
m{Z;f3), such as quantile-based moment functions; (ii) the support of the 
conditioning (proxy) variable X could be unbounded; (iii) the propensity 
score function p{X) does not need to be uniformly bounded away from 
zero and one. Recall that in the program evaluation literature such as in 
Hirano, Imbens and Ridder (2003), the stronger condition < p < p{x) < 
p < 1 is typically imposed for root-n asymptotically normal and efficient 
estimation of /?o • 

3.2. CEP estimation with parametric or known propensity score. Sup- 
pose now that the propensity score p{X) is correctly parameterized as p{X; 7) 
up to a finite-dimensional unknown parameter 7. Theorems 2 and 4 show 
that the optimally weighted CEP-GMM estimator defined in (5) still achieves 
the semiparametric efficiency bound for /3 defined by moment condition (2). 
However, according to Theorems 3 and 4, such an estimator is no longer 
efficient for /3 defined through moment condition (1). 

Rewriting moment condition (1) as E[£{Xi; /?o) ^^^''* ] =0, we can again 

construct an efficient estimator for /3o based on the sieve estimate £{X\0) 
and the correctly specified parametric form p{X]^). In particular, the op- 
timally weighted GMM estimator using the following sample moment con- 
dition will achieve the efficiency bound in Theorem 3 for (3 defined through 
(1): 

(8) '-tUx.pf-^, 

where p = ^ and 7 is the parametric MLE that solves the score equation 
for 7: 

Theorem 5. Let p{X;^) he the parametric propensity score function 
known up to the parameters 7 and let E[S-.fQ{D.,X)S^Q{D^Xy] he positive 
definite. Let Pq satisfy the moment condition (1) and (3 he its CEP-GMM 
estimator using the sample moment (8). Under Assumptions 1-2 and As- 
sumptions 3-5 of CHT, we have \/n{l3 — (3o) ^ M{0,V), with 

V = {Jl'wjl)-^jl'w^pWjl{jl'wjl)-\ 

where Vlp is given in Theorem 3. Further, ifW = , then y/n{P — /9o) =^ 

AA(0,Vo), where Vq = {Jp'^^'^Jp)^^ is the efficiency variance hound given 
in Theorem 3. 
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The proof of this theorem is similar to that of Theorem 2 in CHT and is 
thus omitted. The influence function representation of (8) is stated in the 
working paper version of this article. 

We remark that even when a parametric assumption is being made about 
the propensity score p{X]^) [in fact even if in addition f(Y) is assumed to 
be a parametric likelihood], the inference about [3 is still semiparametric. 
This is because the marginal density f{X) is still nonparametric and con- 
tains semiparametric information about (3. This explains why nonparametric 
estimation is still needed to achieve the bound for (3. 

The case where the propensity score is fully known can be considered 
a special case of parametric propensity score where the parameters are 
known. In this case, the efficient moment condition is as in (8) after re- 
placing p{Xj\^) with the known p{Xj). 

Remark 2. When the auxiliary data set is a validation data set, for 
example, p{X) = p, the parameters /3 defined by both moment conditions 
(1) and (2) coincide. Therefore, the CEP-GMM estimator defined in (5) 
when we take rip = n and the summation to be over the all observations will 
achieve semiparametric efficiency. 

4. IPW-GMM estimation. An alternative estimation method for /3 is 
the inverse probability weighting based GMM (IPW-GMM). Several authors 
have considered inverse probability weighting paired with a conditional in- 
dependence assumption for estimation in presence of missing information. 
Recent examples include parametric IPW as in Robins, Mark and Newey 
(1992), Wooldridge (2002), Wooldridge (2003) and Tarozzi (2007), for miss- 
ing data models, and nonparametric inverse probability weighting as in 
Hirano, Imbens and Ridder (2003) for the case of mean treatment effect 
analysis. In this section, we extend existing results and first show that the 
optimally weighted IPW-GMM estimator of /3 is semiparametrically efficient 
when the propensity score is unknown. The same estimator, however, will 
be generally inefficient when the propensity score is known or belongs to 
a correctly specified parametric family; combinations of nonparametric and 
known or parametric estimated propensity scores are needed to achieve the 
semiparametric efficiency bounds for these cases. 

4.1. Efficient estimation with unknown propensity score. The IPW-GMM 
method uses the fact that under Assumption 2, moment condition (1) can 
be rewritten as 



(9) 




p{X){l-p) 



D = 
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while moment condition (2) is equivalent to 



(10) 



E[m{Z;/3)]=E 



m{Z;(3) 



1 



P 



l-p{X) 



D = 



Let p{X) be a consistent estimate of the true propensity score. Then we 
can estimate (3o defined by case (1) using GMM with the following sample 
moment: 



(11) 



1-p 

p 



and estimate (3o defined by case (2) using GMM with the following sample 
moment: 



(12) 



1 



n- 



rin 



1-p 



The inverse probability weighting approach is considered semiparametric 
when p{X) is estimated nonparametrically. In this case, it can be shown 
that the sample moment (11) evaluated at (3q is asymptotically equivalent 
to 



(13) 



1 1 " 



{1 - Di)m{Zf, 



piXi 



l-p{Xi) 



p{X^ 



l-p{Xi) 



+ Op{l). 



The two components of this influence function are negatively correlated. 
Because of this, the asymptotic variance might be smaller than that of the 
estimator of [3q based on moment condition (11) with the known p{X). 
Simple manipulations are sufficient to show that (13) is identical to the 
influence function in (6) whose corresponding asymptotic variance is 
given in Theorem 1. An optimally weighted GMM estimator for /3o defined 
by case (1) using sample moment (11) achieves then the semiparametric 
efficiency bound stated in Theorem 1. 

The infiuence function representation for sample moment (12) can be 
calculated as 



1 



^1- A)m(Z,;/?o) 



1 



l-p{X^ 



+ S{Xi;(5o) 



Di-p{Xi) 



l-p{Xi) 



+ Op(l), 



whose two components are again negatively correlated. However, it is again 
simple to show that this infiuence function representation is identical to the 
one in (7). Hence, an optimally weighted GMM estimator for /3o defined by 
case (2) using sample moment (12) achieves the bound for case (1) stated 
in Theorem 1. 
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In this subsection, to emphasize that the true propensity score function 
is unknown and has to be estimated nonparametrically, we use Po(x) = 
E{D \ X = x) to indicate the true propensity score and p{x) to denote 
any candidate function. [Note that to save notations in the rest of the 
main text p{x) denotes the true propensity score.] Let p{-) be a sieve es- 
timator of Po{x) that uses the combined sample {{Di,Xi) : z = 1, . . . , n}. Let 
{Zai = {Yai,Xai) :i = 1, . . . ,na} be the auxihary (i.e., D = 0) data set. We 
define the IPW-GMM estimator /? for moment condition (1) as 



/ 




^=argmin(-X:m(Z,,;/3) ^^^"^ ) W 

(14) 

and the IPW-GMM estimator [3 for moment condition (2) as 

/ 1 1 

^ = argmin — Vm(Zai;/3)- — — 

(15) 

/ 1 1 \ 

There are two popular sieve nonparametric estimators of po(')- 
(i) A sieve Least Squares (LS) estimator p\s{x) as in Hahn (1998): 
1 " 

Pis = argmin- V(A -p(^j))^/2, 

T~Ln = \ h{x) = q^'^{x)''K = ^ qj{x)-nj > for some known basis ((7^°° 



In the Appendix we establish the consistency and convergence rate of pis(x) 
under the assumption that the variables in X have unbounded support. 

(ii) A sieve Maximum Likelihood (ML) estimator Praidx) as in 
Hirano, Imbens and Ridder (2003): 



1 " 



Pmlc = argmax - V{A \og\p{Xi)] + (1 - A) log[l - p(^i)]}, 

Hn = {h{x) = [A^^ {x)'TTf] or {h{x) = exp(A'=" {x)'tt)]. 

Before we present the large sample properties of the IPW-GMM estima- 
tor, we need to introduce some notations and assumptions. Let the sup- 
port of X be X = TV^^ . We could use more complicated notations and let 
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X = Xc ^ Adc, with Xc being the support of the continuous variables and 
X^c the support of the finitely many discrete variables. Further we could 
decompose X^ = Xd x A'c2 with Xd = 7^*^^'^ and Xc2 being a compact and 
connected subset of T^'^^-^ . Then, under simple and usual modifications of the 
assumptions, the large sample results stated below would remain valid. To 
avoid tedious notation yet to allow for some unbounded support elements of 
X, we assume X = Xc = TZ'^'' . For any 1 x vector a = (ai , . . . , a^^ ) of non- 
negative integers, we write |a| = J2t=i '^fc' ^ ~ (^i' • • • ' ^d^Y ^ 
we denote the I a| th derivative of a function h:X^TZ as 



aiai 



For some 7 > 0, let 7 be the largest integer smaller than 7, and let A^^X) 
denote a Holder space with smoothness 7, that is, a space of functions 
h:X^lZ which have up to 7 continuous derivatives, and the highest (7th) 
derivatives are Holder continuous with the Holder exponent 7 — 7 G (0, 1]. 
The Holder space becomes a Banach space when endowed with the Holder 
norm: 

||/t||A-y =sup|/t(x)| + maxsup ^^_^ <oo- 

X \a\=2xj^x ,J{x -xy{x -x) - 

We call = {he AT(A')||/i||a7 < c < cx)} a Holder bah (with radius c). 

Define a weighted sup-norm 1 1 5 1| 00,1.1; = s^Vx&x\9{x)[l + \x\'^] '^/^Iforsome 
a; > 0. Denote noons' the projection of g onto the sieve space Jin under 
the norm || • \\^^^. Let fxa{x) = fx\D=oix) and Ea{-) = E{-\D = 0). 

Assumption 3. Let W — W = Op(l) for a positive semidefinite ma- 
trix W, and the followings hold: (1) Po(') belongs to a Holder ball 7i = {p{-) G 
A2{X):0 <p<p{x) <p<l} for some 7 > 0; (2) /(I + < 00 

for some w > 0; (3) there is a function b{-) such that b{5) — > as (5 — > and 
£^a[sup|j^_~l^^ |m(Zj; /?) — m(Zj,/?)|p] < b{d) for all small positive value 6; 

(4) £'a[sup^g5 ||m(Zj;/3)|p] < 00; (5) for any h ^ H, there is a sequence 

HoonTl G T-Cn such that \\h - noon^||oo,a; = o{l) . 

Theorem 6. Let (3 be the IPW-GMM estimator given in (14) or (15). 
Under Assumptions 1, 2 and 3, if — ^ 0, kn — > 00, then (3 — Pq = o„(1). 



Let E{-) = J {■)fx{3[:) dx, \\h\\2 = y / h{x)^fx{x) dx, and 'n.2nh be the pro- 
jection of h onto the closed linear span of q^''-{x) = {qi{x), . . . , Qkni^))' under 
the norm || ■ II2. We need the following additional assumptions to obtain 
asymptotic normality. 
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Assumption 4. Let f3o £ int{B), E [ j:) £{X;l3o)8{X;(3o)'] be posi- 
tive definite, and the followings hold: (1) Assumptions 3.1 and 3.2 are sat- 
isfied with 7 > dx/2 and w > 7; (2) There exist a constant e G (0, 1] and a 
small (^0 > such that 



sup ||m(Zj; /?) — m(Zj, 

W~l5\\<& 



< const. 6'^ 



for any small positive value 6 < 60; (3) Ea[supp^B:\\(3-f3o\\<So W'^i^i'^ PWC^ + 
\Xi\^)'^] < 00 for some smah 5o > 0; (4) E[\\^^^§^\\{1 + |X|2)'^/2] < 00, 
and for ah x£X, is continuous around /3o; (5) K = 0(n'^-/(27+rfx))^ 

sup^gg. . ^„^P . <5jj sup^g^:^' ||<5(x,/3)|| < const. < 00 for some small 5o > 0; or 
(6b) £^a[sup^g5.||^_^j,||<5^ < const. < 00 for some small 5o > 0, 

and fxA-) e A?(;f) with 7 > 3d,/4; or (6c) Ea[snpp^B:\\p-M<So 1 < 

const. < 00 for some small > 0, and fxai') ^ with j > dx- 

Theorem 7. Let (3 be the IPW-GMM estimator given in (14) or (15). 
Under Assumptions 1, 2, 3 and 4, we /laue ^/n{j3 — I3q) =^AA(0, y), wzt/i F 
t/ie same as in Theorem 4. 

Remark 3. (i) The weighting uj is needed since the support of the con- 
ditioning variable X is assumed to be the entire Euclidean space. When X 
has bounded support and fx\D=o is bounded above and below over its sup- 
port, we can simply set a; = in Assumptions 3 and 4 and replace 4.1 with 
the assumption that 3.1 holds with ^> dxjl- Note that Assumption 4.6a is 
easily satisfied when X has compact support. When X = TZ'^^ , Assumption 
4.6a rules out £{x,(3) being linear in x; Assumptions 4.6b or 4.6c allow for 
linear £{x,(3) but need smoother propensity score p{x) and density fx\D=o- 

(ii) Assumptions 3 and 4 again allow for non-smooth moment conditions. 

(iii) Since '^^j^Jx)^ = the assumption < p < Po{x) <p<l im- 
plies that < '^^I^Jx)^^ — T^' hence E{-) and Ea{-) in Assumptions 3 
and 4 are effectively equivalent, (iv) Although Assumption 3.1 imposes the 
same strong condition < p < Po{x) < p < 1 as that typically assumed in 
the program evaluation literature, unlike most existing papers on estima- 
tion of average treatment effects, our paper allows for unbounded support 
of X and assumes weaker smoothness on Po{x) and £{-;Po)- In particular, 

if we let kn = 0(n 27+^0; ), the growth order which leads to the optimal con- 
vergence rate of ||p(-) — Po(-)ll2 = Op{n~"'/^'^'^^'^'^^), then Assumption 4.5 is 
satisfied with ll^i^ -n2„|i^||2 = o(n-'^-/(2(27+rf-))) = o(fc-'/'). 
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4.2. IPW estimation with parametric or known propensity score. The 
case of moment condition (2) is simpler, and therefore, we briefly discuss it 
first. Theorems 1 and 2 have shown that knowledge about the propensity 
score does not change the semiparametric efficiency bound. Furthermore, 
Theorems 4 and 7 show that both a nonparametric CEP-GMM estimator 
and a nonparametric IPW-GMM estimator for f5 achieve this semiparamet- 
ric efficiency bound regardless of whether the propensity score is unknown, 
known or parametrically specified. The following theorem also states, with- 
out proof, the interesting result that the parametric IPW estimator using 
p^X-jj) is in fact less efficient than the one using a nonparametric estimate 
p{X) in (12), but is more efficient than the one using the known p{X). 

Theorem 8. Suppose that E[S'y{D,X)S^{D,Xy] is positive definite 
and that the parametric model p{Xi;j) is correctly specified. Under moment 
condition (2) and using the optimally weighted sample moment condition 
(12), an IPW-GMM estimator for (3 using a parametric estimate of p[Xi]^) 
in place of p{Xi) in (12) is more efficient than the one using the known 
p{Xi), hut is less efficient than the one using a nonparametric estimate p{Xi) 
of the propensity score. 

This result is based on the following relations, which hold asymptotically: 
f^-^ l-p{Xi;-f)J V^"i^i ^-p{Xi)J 




Now consider the more interesting case where moment condition (1) holds 
and sample moment condition (11) is used. Consider the case when the 
parametric propensity score is correctly specified. First, it is clear that the 
optimally weighted IPW-GMM estimator of (3 based on (11) that uses a 
nonparametric estimate of p{X) does not achieve the efficiency bound in 
Theorem 3, because we see from Theorem 7 that this estimator achieves 
instead the variance bound in Theorem 1, which is larger than the variance 
bound in Theorem 3. 

However, the parametric two step IPW estimator that uses a parametric 
first step for p{X;'y) does not achieve the efficiency bound in Theorem 3 
either. To see this, note that the parametric two step IPW estimator is 
based on the moment condition 



l) p 
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which has a hnear influence function representation of 



1 

P 

where 



m{Zi;Po) 



(1 - DMXj 
l-p{X^) 



+ Proj f(Xi;/?o) 



D^-piXi) 



PvoU£{Xi;(3o) 



E 



£{X-(5o) 



Di-pjXi) 
l-p{X,) 
p,{X) 



l-p(X 

Sy{Di, Xi) 



Sy{Di, Xi^ 



l-p{X)_ 

xE[S^{D,,Xi)S^{D,,Xi)'Y^S^{Di,Xi) 

is the influence function from the first step estimation of 7. The difference 
between this influence function and that in Theorem 3 can be verified to be 
equal to 

p{X) 



Res({D-p{X))- 



-£{X;po) Sj{Di,X, 



l-p{X) 

which is obviously orthogonal to the influence function of Theorem 3. There- 
fore, the two step parametric IPW estimator has a variance larger than the 
efficiency bound under the assumption of correct specification of the para- 
metric model for p(X;'j). 

An IPW type estimator that achieves the efficiency bound under correct 
specification can be obtained by combining both nonparametric and para- 
metric estimates of the propensity score. Such an efficient moment condition 
is given by 



(16) 



1 Y^m(Z-3)^^^^^-P 



where 7 is the maximum likelihood estimator for 70 and p{X) is the sieve 
estimate of the propensity score. This moment condition has the following 
asymptotic linear representation: 



1 1 " 



il-Di){m{Zi;Po)-£{Xi;l3o)) 



p{Xi 



l-p{Xi) 



+ p{Xi)£{Xi-f3o 



£iX;Po 
P 



-P-yiXi) 



V^(7 -7), 



which is identical to the influence function under correct parametric specifi- 
cation p{X;j) leading to the semiparametric efficiency bound in Theorem 
3. 

The case where the propensity score is fully known can be considered a 
special case of parametric propensity score where the parameters are known. 
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In this case, the efficient moment condition is as in (16) after replacing 
p{Xj;^) with the known p{Xj). 

It is finally worth noting that Assumption 2 is an identification assump- 
tion that is not testable. Therefore, both the CEP-GMM estimator and the 
IPW-GMM estimator will converge to the same population limit regard- 
less of whether Assumption 2 holds, as long as the same weighting matrix 
is being used. The population difference between CEP and IPW can only 
arise from the parametric mis-specification of the approximating models for 
£{X;p) and p{X). 

5. Empirical illustration. We illustrate our method empirically using 
data from the Indian National Sample Survey (NSS), which is used to mon- 
itor changes in the distribution of private consumption in India. Several 
researchers have argued that changes in survey methodology caused non- 
comparability between poverty estimates calculated for 1999-2000 and those 
from previous years. Changes in the questionnaire likely led to the overesti- 
mation of food consumption, and hence to the underestimation of poverty 
[Deaton and Kozel (2005), Tarozzi (2007)]. In other words, a missing data 
problem arises because the variable of interest (expenditure recorded using 
the "standard" questionnaire) is not observed. Deaton and Dreze (2002), 
Deaton (2003) and Tarozzi (2007) argue that expenditure in a set of miscel- 
laneous items for which the questionnaire was not modified ("comparable 
items" hereafter) can be used as a proxy variable to produce an estimate of 
poverty for 1999-2000 that is comparable with previous years. 

We assume that the object of interest is the cumulative distribution func- 
tion (c.d.f.) for rural India in 1999-2000 of a measure of total monthly ex- 
penditure that is comparable with previous NSS rounds. In the terminology 
used in this article, this situation corresponds to a verify-out-of-sample case 
(1), where the parameter of interest (3 is identified in terms of a variable 
Y that is not observed in the primary sample (the 1999-2000 survey). The 
moment function takes the form m(Z;/3o) = 1{Y < y) — (3q, where y is a 
given threshold. We use the previous round of the NSS (1993-94) as aux- 
iliary survey, and expenditure in "comparable items" as proxy variable X. 
The crucial identifying assumption is that the distribution of Y conditional 
on X remained stable between 1993-94 and 1999-2000 [Tarozzi (2007)]. 

Table 1 reports point estimates and standard errors for the c.d.f. at se- 
lected thresholds. The first column reports estimates using the noncompara- 
ble data from the primary sample. Column 2 reports CEP-GMM estimates, 
calculated using 3rd order polynomial splines in expenditure in comparable 
items as sieve basis, with 10 knots at the equal range quantiles of the empiri- 
cal distribution of the proxy variables. Column 3 reports estimates obtained 
using moment condition (9), but with a nonparametric first step where we 
estimate P{X) using sieve-logit, including the basis functions we used for 
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Table 1 

Cumulative distribution functions (xlOO) of total (log) household expenditure 

(1) (2) (3) (4) (5) 

y Unadjusted Adjusted Adjusted Adjusted Adjusted 

(primary) NP CEP NP IPW Par. IPW Eff. CEP 

6 2.92 (0.067) 3.388 (0.0695) 3.387 (0.0694) 3.15 (0.0594) 3.23 (0.0598) 
6.25 5.67 (0.092) 6.521 (0.0948) 6.522 (0.0948) 6.31 (0.0846) 6.38 (0.0845) 
6.50 11.06 (0.125) 12.272 (0.1237) 12.273 (0.1234) 12.21 (0.1165) 12.21 (0.1149) 
6.75 20.28 (0.161) 21.679 (0.1588) 21.674 (0.1587) 21.89 (0.1645) 21.76 (0.1575) 

7 34.06 (0.189) 35.052 (0.1763) 35.041 (0.1772) 35.53 (0.1794) 35.28 (0.1738) 
7.25 50.75 (0.200) 50.600 (0.1967) 50.592 (0.1975) 51.19 (0.1948) 50.88 (0.1920) 
7.50 66.98 (0.188) 65.682 (0.1925) 65.687 (0.1929) 66.15 (0.1973) 65.91 (0.1880) 

Source: Authors' calculations from Indian National Sample Survey, rounds 50 (1993-94, 
n = 58,846) and 55 (1999-2000, n — 62,679), rural sector only from the major Indian 
states, which account for more than 95% of the total population. Column (1) — Calculated 
from the unadjusted primary sample. Column (2) — CEP-GMM cubic sieve Estimator, 
with 10 knots, using "comparable items" as predictor. Column (3) — IPW-GMM. Flexible 
logit with cubic sieve, with 10 knots, using "comparable items" as predictor. Column (4) — 
Parametric IPW Estimator. The propensity score is estimated using logit and including 
total expenditure in "comparable items" as sole predictor. Column (5) — Semiparametric 
estimator efficient for the case of correctly specified propensity score. 



CEP-GMM as regressors. In column 4 we impose a parametric model, and 
we estimate the propensity score using logit, with X entered linearly in the 
single index. Column 5 reports the results for the estimator described in 
Section 3.2, which is efficient when a parametric model is correctly specified 
for P{X). 

For values of Y below 7 the adjusted estimates of the cdf in columns 2 to 4 
are slightly larger than the unadjusted figures in column 1. As expected, CEP 
and IPW non-parametric estimators produce virtually identical results. The 
estimates in columns 4 and 5 impose a simple logit for the propensity score, 
but they are still very similar. In the verify-out-of-sample case, knowledge of 
a parametric form for P{X) lowers the semiparametric efficiency bound, and 
this may explain why in some cases the standard errors in column 4 are lower 
than those in columns 2 and 3, where the estimator is only efficient when 
P{X) is unknown. Note also that when the parametric assumption is correct 
the efficient estimator is the one in column 5. Indeed the standard errors for 
this estimator are always lower or virtually identical to those in column 
4 every time this latter estimator is more precise than the nonparametric 
estimators in columns 2 and 3. 

6. Conclusions. We derive semiparametric efficiency bounds for the esti- 
mation of parameters defined through general nonlinear, possibly nonsmooth 
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and over-identified moment conditions, when variables in the primary sam- 
ple of interest are missing. For identification we rely on the validity of a 
conditional independence assumption and on the availability of an auxiliary 
sample that contains information on the relation between missing variables 
and other proxy variables that are also observed in the primary sample. We 
study two alternative frameworks. In the first case ( "verify-out-of-sample" ) 
validation is done with an auxiliary data set which is independent from the 
primary data set of interest. In the second case ( "verify- in-sample" ) a subset 
of the observations in the primary sample is validated. 

We show that the optimally weighted CEP-GMM estimators achieve the 
semiparametric efficiency bounds when the propensity score is unknown, or 
is known or belongs to a correctly specified parametric family. These estima- 
tors only use a nonparametric estimate of the conditional expectation of the 
moment functions, and their asymptotic efficiency is obtained under regular- 
ity conditions weaker than the existing ones in the literature. In particular, 
these CEP-GMM estimators still achieve efficiency bounds when proxy (con- 
ditioning) variables have unbounded supports and moment conditions are 
not smooth. 

We also prove that an optimally weighted IPW-GMM estimator is semi- 
parametrically efficient with fully unknown propensity score. However, this 
estimator is not efficient when the propensity score is either known, or is 
parametrically estimated using a correctly specified parametric model; in 
such instances, appropriate combinations of nonparametric and parametric 
estimates of the propensity score are needed to achieve the efficiency bounds. 

We have also demonstrated that, from the theoretical point of view, the 
CEP-GMM estimators are more attractive than the IPW-GMM estimators. 
Recently and independently Imbens, Newey and Ridder (2005) advocated a 
similar sieve conditional expectation projection based estimator for the av- 
erage treatment effect parameter in program evaluation applications. Also, 
for the estimation of the average treatment effects in missing data models, 
Wang, Linton and Hardle (2004) suggested that a semiparametrically spec- 
ified propensity score, such as a single index or a partially linear form, can 
be used to reduce the curse of dimensionality in the nonparametric estima- 
tion of the propensity score. An interesting topic for future research is to 
study the efficiency implications of these semiparametric restrictions on the 
propensity score. 

APPENDIX A: CALCULATION OF EFFICIENCY BOUNDS 



Proof of Theorem 1. We follow closely the structure of semiparametric efh- 
ciency bound derivation of Newey (1990) and Bickel, Klaassen, Ritov and Wellner 
(1993). 
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Case (1). Consider a parametric path 9 for the joint distribution of Y^D 
and X. Define pg = Pq{D = 1). The joint density function for Y^D and X is 
given by 

(17) fe{v,x,d) =pi{l-pef-''fe{x \ D = lffe{x \ D = Of~''f{y \ x)'-''. 
The resulting score function is given by 

Sg{d, y, x) = ^ Pe + (1 - d)se{x | = 0) 
P6i(l - pe) 

+ dse{x \D = l) + {l- d)sg{y \ x), 

where se{y \ x) = ^logfe{y \ x), pg = ^pg, sg{x \ d) = ^log fg{x \ d). The 
tangent space of this model is therefore given by: 

(18) T = a{d-pg) + (1 - d)sg{x \D = 0) + {1- d)sg{y \ x) + dsg{x \D = l), 

where / sg{y \ x)fg{y | x) (iy = 0, / sg{x \ d)fg{x \d)dx = 0, and a is a finite 
constant. 

Consider first the case when the model is exactly identified. In this case 
13 is uniquely identified by condition (1). Differentiating under the integral 
gives 



(19) 



dd 



m{Z-p) 



d\ogfg{Y,X\D = l) 



D = l 



The second component of the right-hand side of this expression can be cal- 
culated as 

(20) E[m{Z; p)sg{Y \Xy\D = l]+ E[m{Z; P)sg{X \ D = l)' \D=l]. 
Pathwise differentiability follows if we can find '^^{Y,X,D) G T such that 

(21) dm/de = E[^\Y,X,D)Sg{Y,X,D)']. 

Define pg = J Pe{x)fg{x) dx, £g{X) = E[m{Z; (3) \ X]. It can be verified that 
pathwise differentiability is satisfied by choosing: '^^{Y,X,D) = —{J'^)~^ x 
F^{Y,X,D) where 

(22) F'p{Y,X,D) = ^-^i^[m{Z;P) - £{X)] + ^D. 

pi — p[X) p 

Since J'^ is a nonsingular transformation, this can be shown proving that 



E 



(23) 



d 



m{Z;P)— log fg{Y,X\D = l) 
= E[F'p{Y,X,D)Sg{Y,X,Dy]. 



D = l 
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This can in turn be verified by checking that 
E[m{Z;f3)sg{Y\Xy\D = l] 
l-D p{X) 



P i-p{x) 

E[m{Z; l3)se{X \ D = 1)' \ D = l]= E 



m{Z-(3)-£{X)]se{Y\X)' 
£{X) 



Dse{X\D = iy 



Now one can also verify that F^iY, X, D) belongs to the tangent space 
T in equation (18), with the first and second terms of Fp{Y,X,D) taking 
the role of (1 — d)sg{y\x) and dsg{X\D = 1), respectively, and the two other 
components in (18) being identically equal to 0. 

Therefore all the conditions of Theorem 3.1 in Newey (1990) hold, so 
that ^'^ is the efficient score function and the efficiency bound for regular 
estimators of the parameter P is given by 

(24) vi = {j'^r'E[F'^{Y,x,D)F'p{Y,x,Dy]{j^y~' = {j^r'n}{j^y-\ 

Case (2). For this case we use an alternative factorization of the likelihood 
function. Define pg{x) = Pg{D = 1 | x). The joint density function for Y, D 
and X is given by 

(25) fg{y, X, d) = fg{x)pg{xy[l - peix)]^^" fe{y \ xf-". 
The resulting score function is then given by 

d-pg{x) 



Seid, y, x) = (1 - d)sg{y \ x) + 



pg{x){l-pg{x)) 



where 



d d 
i{y\x) = -^logfg{y\x), pg{x) = —pg{x) 



Pe{x) +tg{x), 



d 



te{x) = —\ogfe{x). 



The tangent space of this model is therefore given by: 

(26) T = {{1- d)se{y \ x) + a{x){d - pg{x)) + tg{x)} 

where / sg{y \ x)fg{y \ x) dy = 0, J tg{x)fg{x) dx = 0, and a{x) is any square 
integrable function. 

In case (2), equation (19) is replaced by 



(27) 



de 



'E 



91og/e(y,X) 



-{jlr^{E\,ii{z-{i)sg{j I xy\-rE\e{x)tg{xyw. 



Now we replace Fl{Y,X,D) in (22) with the following: 



(28) 



Fl{Y,X,D) 



1 - D 
I-P(^) 



[m(Z;/3)-£:(X)]+£:(X) 
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and then it can be verified that E[F^{Y,X,D)Se{Y,X,Dy] = E[m{Z;(3) x 

'^^"^'do'^'^^ ]- Then the efficient influence function for case (2) is equal to 
-{J^)~^F^(Y,X,D) with the two terms being orthogonal to each other, 
and the second result in Theorem 1 follows. 

Now consider over identified moment conditions. We only consider case 
(1), as the derivation for case (2) is analogous. When dm > dp, the moment 
conditions in (1) is equivalent to the requirement that for any matrix A 
of dimension dp x dm the following exactly identified system of moment 
conditions holds AE[m{Z; (3) \ D = 1] = 0. Differentiating again, 



dpje) 
de 



AE 



dm{Z;(3) 



dp 



D = l 



-1 



X E 



Am{Z;f3) 



dlogfe{Y,X\D = l) 
89' 



D = l 



Therefore, any regular estimator for /3 is asymptotically linear with influence 
function of the form 



AE 



dm{Z;f3) 



d(3 



D = l 



Am{z; 13). 



For a given matrix A, the projection of the above influence function onto 
the tangent set follows from the previous calculations, and is given by 
— [AJ^p]~^Fp{y,x,d). The asymptotic variance corresponding to this effi- 
cient influence function for fixed A is therefore 



(29) 



[AJl]-^A^A![J^p'A!] 



where Q. = E[Fl{Y,X,D)Fl{Y,X,Dy] as calculated above. Therefore, the 
efficient influence function is obtained when A minimizes (29). It is easy to 
show that such matrix A is equal to ^Tg'il"^, so that the asymptotic variance 
becomes V = {J^^^^Jp)^^. In fact, a standard textbook calculation shows 

)p-JpA!{AVLX)-^AJp 



X (0-1/2 _ f]l/2'[J7l/2f^l/2/]-lj^l/2f^-l/2^^1) > 0^ 



□ 



Proof of Theorem 2. As for Theorem 1, it suffices to present the 
proof for the case of exact identification, since the overidentified case follows 
from choosing the optimal linear combination matrix. If the propensity score 
p{x) is known, the score becomes [cf. Hahn (1998)] Sg{d,y,x) = (1 — d)so{y \ 
x) +tQ{x), so that the tangent space becomes T = {{I — d)s0{y \ x) + tg{x)} 
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where / SQ{y \ x)fg{y \ x)dy = 0, and / tg{x)fg{x) dx = 0. Consider case (1) 
first. The pathwise derivative becomes 



E 



PjX) 
P 



m{Z-(5)s{Y\Xy 



+ E 



p{X) 



P 



£{x)t{xy 



Pathwise differentiabihty is estabhshed by verifying that equation (21) holds, 
with 

(30) Fj(y,x,d) = i^-^^(m(z;/3)-^(x)) + ^p(x). 

pi — p[x) p 

Then the efficient influence function is as before equal to —{>Jp)~^ Fp(y, x, d), 
and the result of Theorem 2 follows using Theorem 3.1 of Newey (1990). 

Since p{x) does not enter the definition of /? in case (2), there is no change 
to the efficient influence function and to the semiparametric efficiency bound 
for that case. □ 



Proof of Theorem 3. When p{X) belongs to a correctly specified 
parametric family p{X;-y), the score function for moment (1) becomes 



Se{d, y, x) = (1 - d)sg{y \ x) + 



d-pdjx) dp{x;-f)d-f 
pg{x){l - pe{x)) dj' de 



+ te{x) 



The tangent space is therefore T = {(1 — d)sg{y \ x) + c'S^{d;x) + tg{x)} 
where c is a finite vector of constants and Sy{d;x) is the parametric score 
function. Now define -Fg {Y, X, D) as 

.K^;/3)-.(X)]+Pro,r.(X).^-^W 



p i-p{xy 



p 



S-,{D,X)]. 



It is clear that F^{Y^X^D) lies in the tangent space. Also note that 
can be written as 



dm 

ae 



{J^yHE[m{Z;/3)se{Y\Xy\D = l] 



+ E 



m{Z;p)[tgixy + S^{d;xy^ 



D = l 



The second term in curly brackets can also be written as 

E{D-p{X))£{X)S^{D;Xy dj ^ p{X)£{X)te{X) 
p 86 p 

With these calculations it can be verified that 

dp{e) 



de 



- {J^)-^E[F^ (y, X, D)Se (Y, X, D)] . 
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In particular, 



E 



{D-p{X))£{X)S^{D;Xy 



P 



■ E 



D-p{X) 



P 



S-y{D,X)]Se{Y,X,D)' 



Therefore — (^Tg ) ^Fj^ (y, X, D) is the desired efficient influence function and 
its variance is given as the efficient variance of Theorem 3. □ 

APPENDIX B: PROOFS OF ASYMPTOTIC PROPERTIES 

In this Appendix we establish the large sample properties for the IPW- 
GMM estimator with nonparametrically estimated propensity score func- 
tion. Again to stress the fact that the true propensity score is unknown, in 
this Appendix we denote the true propensity score by Po{x) = E[D\X = x] 
and any candidate function by p(x). 

Denote 



£2{X) = \h:X^n:\\h\\2 



and 



2,a 



h{x)^ fx{x) dx < OO 



K^YfXa{x)dx < oo 



as the two Hilbert spaces. We use ||/i||2 Il^lb.a to mean that there are two 
positive constants ci,C2 such that ci||/i||2 < ll^lb.a < C2||/i||2, which is true 
under the assumption < p < Po{x) <p < 1. 

Proposition B.l provides large sample properties for the sieve LS estima- 
tor p{x) of Po{x). 



Proposition B.l. Under Assumptions 3.1, 3.2 and 3.5, and ^ 

kn — > oo, we have (i) 

-Poi-)\\oo,. = Op(l); \\p{-) -Po{-)h,a >^ IIp(-) -Poi-)h 

(ii) in addition, if Assumption 4.1 holds, then 



0, 



Op(i); 



\\P{-) -Poi-)ha ^ \\pi-) -Poi-)h = Op^j^+ikr^y/"^ 



Proof, (i) Recall that p{x) is the sieve LS estimator of Po(-) £ 
based on the entire sample. That is, 

p{-) = argmin - - p{Xi)}^ /2, 
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where Tin increases with sample size n, and is dense in A2{X) as kn 
oo (by Assumption 3.5). Moreover, by Assumptions 3.1 and 3.2 we have 
the fohowing results: (1) the parameter space is compact under the norm 
II • ||oo,a; for a; > 0, see Ai and Chen (2003); (2) E[{Di -p{Xi)y/2] is uniquely 
maximized at Poix) = E[D\X = x] GTi; (3) E[{Di —p{Xi)}'^/2] is continuous 
in p{-) under the metric || • ||oo,a;; and (4) 



sup 

P(-)6W 



1 " 



- J2{Di - piX,)}^/2 - E{Di - p{X,)}^/2 



n " , 



Op(l); 



where both results (3) and (4) are due to the fact that for any p{-), p{-) G Ti., 
\{Di-p{X,)}^-{Di-p{Xi)}^\ 

= |{2A - [piX.) +p(X,)]}b(X,) -p{X^)]\ 

< const.\[piX^) -KX0](1 + X,'Xi)--/2| X (1 +X;X,)-/2. 

Now E[{1 + XlXi^/^] < oo by Assumption 3.2. 

Hence by either Theorem in Gallant and Nychka (1987) or Lemma 2.9 
and Theorem 2.1 in Newey (1994), \\p{-) - po{-)\\oo,uj = Op{l). Now 



Pi-) - Poi-)\\2 = \ [p{x) -Po{x)Yfx{x)dx 



< y (IIp(-) - Po(-)lloo,.)' y (1 + x'xYfx{x) dx = Op(l) 

(by Assumption 3.2). 

(ii) We can obtain the convergence rate of ||p(-) — po(')ll2 by applying The- 
orem 1 in Chen and Shen (1998) or Theorem 2 in Shen and Wong (1994). 
Let Ln{p{-)) = ^j:tiKDi,Xi,p{-)) with i{Di,Xi,p{-)) = -{Di-p{Xi)}y2. 
Since all the assumptions of Chen and Shen (1998) Theorem 1 are satisfied 
given our Assumptions 3.1 and 3.2. We obtain 



■Po[-)\\2 



Op ^maxj \\po - Il2nPo\ 



Under Assumption 4.1, for po G A2(A'), there exists HoonPo G K^i'^) such 
that for any fixed w > 7, 

IIPo - noonPo||oo,c^ = SUp\[po{x) - UoonPo{x)]{l + |xp)"'^/^| 

X 

<const.{kn)-^^'^-, 
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see Ai and Chen (2003). Hence by Assumption 4.1 with w = 7 + e for a smah 
e>0, 

\\Po - n2nPo||2 < IIPo - IloonPolb 



[Vo{x) - TloonPo{x)Yfx{x) dx 



< y(||Po(-) -noonPo(-)lloo,c^) y (1 + X'x)'^/X(x)dx 

<c'(fc„)-^/'^-. 

Then ||p(.)-p„(.)||2 = Op(y^+(A:„)-7/<i.) = Op(l). □ 

Proof of Theorem 6. We only provide the proof of the IPW-GMM 
estimator for moment condition (1), since the one for moment condition 
(2) is very similar. We establish this theorem by applying Theorem 1 in 
Chen, Linton and van Keilegom (2003) (hereafter CLK) with their 6 being 
om' (3 and their h being our p{-)- Define 



1 ' °. 

M.(Ap(.)) = -E-(Z./3)^-^; 



M{(3,pi-)) = Ea 



m{Zi,f3)- 



P{Xi. 
p{X^ 



E 



m{Z,(5) 



P{X) 



l-p{X) 



D = 



l-p{Xi 

CLK's conditions (1.1) and (1.2) are directly implied by our Assumptions 
1.1, 2 and moment condition (1). Note that for any p(-) G "H, < < 

< 1^ < 00, we have 

\M{P,p{-))-M{(3,poi-))\ 

P{X) 



E 



m{Z,f3) 



PojX) 

l-piX) l-poiX) 
<J^^Ea[\\m{Z,mil + \X\'r^'] 
X sup|[p(x) -po{x)]{l + |xp)"'^/2| 



D = 



< 



1 



sup \\m{Z, 



xi?,[(l + |X|2)-]| 



1/2 



(l-p)2 
X IIp(-) -Po(-)lloo,^, 

where the last inequality is due to our Assumptions 3.1, 3.2 and 3.4, hence 
CLK's condition (1.3) is satisfied with respect to the norm || • ||-^ = || • ||oo,lj- 
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B.l(i). Note that 
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^,u) = Op(l) is implied by Proposition 



Ea 



sup 

ii,s-^ii<5,iip{.)-?{-: 



.<<5 



l-p{Xi) 



l-p{Xi 



sup ||m(Zj,/3) — m(Zj,/3)|| X sup 

i|/3-/3|l<<5 P(-)eH 



+ Ea 
<Ea 
+ Ea 



sup \\m{Zi, 



X sup 

||p(-)-?(-)l|oo,c.<5 



l-p{Xi 
p{Xi) 



p{Xi 



sup \\m{Zi^ P) — m{Zi 
ll/3-^li«5 



sup||m(Z,,/3)||(l + |X,|2)-/2 



P 



1 — p 



2\-uj/2\ 



^^P\\pi-)-p(-)\\^^<6^^Px&xMx)-p{x)]{l + \x\ ) 

< const.b{S) + const.S, 

where the last inequahty is due to our Assumptions 3.1-3.4 and Proposition 
B.l(i). Then CLK's condition (1.5) is satisfied, hence (3 — (3q = Op(l). □ 

Lemma B.2. Under Assumptions 1, 2, 3 and 4, we have 

Proof. To estabhsh this result, we follow the approach in Shen (1997) 
and Chen and Shen (1998). Recah po[x) = E[D\X = x] e and 

1 " 

p(-) = argmin - ^{ A - p{Xi)Y /2. 

p(-)6Wn ^ i = l 

Define the inner product associated with the space C2{X) as 

{h,g) = E{h{X)g{X)} hence \\h{-)\\l = {h, h) = E[{h{X)Y]. 

Then the Riesz representor v* for functional E{£[X, Po) '^^ilp'^(^x^^ i ^® sim- 
ply given by 



v*{X) 



£{X,(5o) 

i-Po{xy 
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this is because 

* l|2 



V 



sup 



[E{£{x, (3o)mx) - po{x))/{i - poixmr^ 



EMx)-po{xW] 



■ E 



six,po) 

l-Po{X) 



2i 



and 



= E{v*{XMX)-po{X)]}. 

Let LM) = \ Y.':=i^{Di,XiM) with l{D,,Xup{-)) = -{A - 
p{Xi)Y/2. Let Ui = Di - Po{Xi). Then by definition E[Ui\Xi] = 0, and 
i{Di,Xi,p{-)) = -{Ui - [p{Xi) - po{Xi)\Y/2. We denote ^in{9) = ^ x 
Y^^=i[g{Di., Xi) — E{g{Di, Xi))] as the empirical process indexed by g, and 
En be any positive sequence with e„ = o(-^). Then by definition, 

< Lnip) - L„(p±e„n2„u*) 

= fin{i{Di,Xi,p) - e{D^,Xi,p±enU2nV*)) 

+ E{£{Di,Xi,p)-£{D,,Xup±enn2nV*))- 
A simple calculation yields 

E{£{D,,Xi,p)-i{Di,Xi,p± e„n2„t;* ) ) 

= ±enE[U2nV*{Xi){p{Xi) -po{X^)}] 

+ lelE[{U2nV*{X,)}\ 

t,niiiDi,Xi,p) - l{Di,Xi,p±£nU2nV*)) 
= =Fe„ X fln(n2nV*Ui) 

.2{p{-)-po{-)}±enIl2nV* 



hence 



± e„ X /i„ \^2nV 

< TM^2nV*{X,i)U)±E[U2nV*{X,){p{X^)-po{X,i)}] 

n 

±lJin{Ii2nV*{X,){p{Xi) -po{Xi)}) + ^Y^i^^^V* {Xi)Y 



i=l 



T/in([n2„t;* - V*]Ui) ± Hn{v*Ui) 

± E[[U2nV* - v*]{p - po}] T E[v*{p-po}] 

n 

±fIn(Jl2nV*iXi){p{X,) -p,{X,)}) + ^Y.{^^nV*iXi)}^ 



i=l 
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In the following we shall establish (B2.1)-(B2.4): 



(B2.2) E{[U2nV*{Xi)-v*{Xi)]{p{Xi) - Po{Xi)}) 



(B2.1) 



(B2.3) 



^in{Il2nV*{X,){p{Xi) -po{Xi)}) 



^ln{[Il2nV*{Xi)-V*{Xi)]U^) 




1 



n 



(B2.4) 



n 



Y^{U2nV*{X,)}' 



1=1 



Note that (B2.1) is implied by Chebychev inequality, i.i.d. data, and ||n„t;* — 
"1^*112 = o(l) which is satisfied given the expression for v* and Assumptions 
3.1 and 4.5. (B2.2) is implied by Assumption 4.5 and \\p{-) — Po{-)\\2 = 
Op{n~'^/^'^"'~^'^'^^) from Proposition B.l(ii). (B2.4) is implied by Markov in- 
equality, i.i.d. data, and Assumptions 3.1 and 4.5. Finally for (B2.3), let J^n = 
{U2nV*{-)h{-):h{-) £ A2{X)},thenhy Assumption 4 . 1 , log iV[.] ( 5, :r„ , 1 1 • 1 1 2 ) < 
const. ( I )'^''/'^ for any 6 > 0. Applying Theorem 3 in Chen and Shen (1998) 
with their 6n = n~'^/(^'''+''^\ we have 



Hence we obtain (B2.3). Now (B2.1)-(B2.4) imply < ±fin{v*Ui)^E[v*{p- 
Po}]+Op{j^), that is V^E[v*{X){p{X)-po{X)}] = ^j:^^,v*{Xi)U^ + 
Op(l), hence the result follows. □ 

Proof of Theorem 7. Again we only provide the proof of the IPW- 
GMM estimator for moment condition (1). We establish this theorem by 
applying Theorem 2 in CLK (2003). Given the definition of (5q and Theorem 
6, CLK's condition (2.1) is directly satisfied. Note that their Ti{(3,po) = 
J^, hence their condition (2.2) is satisfied with our Assumption 1.1. 

Following the proof of CLK's Theorem 2, we note that the conclusion of 
CLK's Theorem 2 remains true when CLK's conditions (2.3) (i) and (2.4) 
are replaced by the following one: 



sup \VnfJ.n(n2nV* {h{-) - Po{-)})\ 

he:Fn.\\h{-)^Po{-)\\2<Sn 



= Op(n"(27-d.)/(2{27+d.)))^0^(l)_ 



sup 

/3eS||/3-/3o||<5o 



\\M{P,p{-))-M{l3,po{-)) 



(*) 




r2{/3,Pom-)-Po{-)]\\=Op{7l 



EFFICIENT GMM WITH AUXILIARY DATA 



31 




32 



X. CHEN, H. HONG AND A. TAROZZI 



+ Ea 



sup \\m{Zi, 

/3GB: ||/3-/3o||<(5o 



X sup 

||p(-)-?(-)l|oo,c.<5 



l-p{Xi) l-p{Xi 



2n 



+ Ea 



sup \\m{Zi,f3) -m{Yi,Xi,'^''"'^ 
\\l3-m<S 



P 



1 — p 



sup \\m{Z^,P)\\\l + \Xi 
'-/36B||/3-/3o||<<5o 



l2\uj 



X sup sup| — + |j; 

lb(-)-?(-)lloo,c.<5^eA' 

< const. 5'^'^ + const. d'^ for some e G (0, 1], 



2\-ci;/2|2_ 



(1-^)2 



where the last inequahty is due to our Assumptions 4.2, 4.3 and Proposition 
B.l(i). In the following we let N{e,A2{X),\\ ■ \\oo,uj) denote the || • ||oo,aj- 
covering number of A2{X) [i.e., the minimal number of for which there 
exist e-balls {h: \\h — Uj\\oc,Lu < s}, J = Ij • • • i A to cover AJI(A')]. Then our 
Assumption 4.1 implies 

oo,uji < const. \ - I 



logN{6,A2{X), 
V'logA(<^,A?(;f),|| 

■ l|co,a;) d6 K OO 



Thus by applying CLK's Theorem 3, CLK's condition (2.5) is satisfied. 
It remains to verify CLK's condition (2.6). First we note 



^MniPo,Po) 



1 



E/ ry n \ Po{Xi) 



Next we notice 



/n^T2{(3o,Po)\p{-) -Po{-)] 
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V^{Mn{(3o,Po) +r2{(3o,Po)[p{-) - Po{-)]} 
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1 



1 — p 
1 



+ Op(l), 

thus CLK's condition (2.6) is satisfied. Moreover from the proof of CLK's 
Theorem 2 we obtain 

V^0-Po) = -{r[wriy'r[w^{Mn{i3o,Po) 
+ r2{po,Pom-)-Po{-)]} + op{i) 

^ ~ ^-{j}'Wj})~^j}'W^{Mn{Po,Po) 



p 



Since ^ = T^ + o,(l), 



+ T2{(3o,Po)[p{-)-Po{-)]}+Op{l). 
{4'Wj})-'j}'WV^{Mn{Po,Po) 



+ T2iPo,Pom-)-Poi-)]} + Opil) 

-(j}'w4r'4'w- ' 



py/n 

+ Op(l), 

thus we obtain Theorem 7 after we estabhsh condition (*). 

We now apply Assumption 4.6a or 4.6b or 4.6c to verify condition (*). 
Since 

M{p,p{-)) - M{p,po{-)) - r2{P,PoM-)-Po{-)] 

p{X) po{X) p{X)-po{X) 



Ea\m{Z,p) 



1-piX) 1-poiX) (l-poiX))^ 



m{Z,(3)\p{X)-po{X)] 
l-Po{X) 



1 



1 



l-p[X) l-po{X) 
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\ £{X,P)\p{X)-po{X)f 
\{l-p{X)){l-po{X)f 



we have under Assumption 3.1, 

sup ||M(/3,p(.)) - M{P,po{-)) - r2(APo)b(-) -Po{- 

PeB: ||/3-/3o||<5o 



sup 

/3eB: ||/3-/3o||<<5o 



E. 



, f £{X,PMX)-po{X)]'' 
n(l-p(X))(l-p,(X))2 



< 



1 



:En 



(l-p)3 

If Assumption 4.6a holds, then 

Ea 



sup \\£{X,P)\\x{piX)-po{X)y 

/3eB:||/3-/3o||<5o 



sup \\£{X,P)\\x\p{X)-po{X)Y 

/3GB:||/3-/3o||<(5o 

< sup sup\\£{x,P)\\xEaMX)-po{X)f} 

/3GB:|j/3-/3oll<<5o ^ 

<const.[\\p{-)-po{-)h,a?. 

Now Proposition B.l(ii), fc„ = 0(?i''-/(27+d^)) ^nd 7 > 4-/2 imply 
Po(')ll2,a]^ = Op(n~^/^), hence condition (*) is satisfied. 
If Assumption 4.6b holds, then 



Ea 



sup \\£{X,P)\\x\p{X)-po{X)Y 

/3GB: ||/3-A)||<'5o 



< E, 



sup 

/3GB:||/3-/3o||<5o 



1/4 



41^1/4 



x^ii;,{b(x)-p„(x)]2} 

< const. x[\\p{.)-po{-)h,a?"''^/^'^'^ 



for all 



Po(-)ll2,a = o(l), 



where the last inequality is due to the following inequalities for any s € 

{Ea{\p{X)-po{Xt}f' < COnst.i\\pi.)-poi-)h,a + II VHp(-) " Po(-)} II 2,a) , 
IIV^M-) -Po(-)}l|2,a < COnst.[\\p{.) 

Now Proposition B.l(ii), = 0(n'^-/(27+d^)) and 7 > 3^2; /4 imply [||p(-) - 
Po(-)ll2,a]^~"'^'^^^'^'' =Op{n~^/'^), hence condition (*) is satisfied. 
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If Assumption 4.6c holds, then 
eJ sup \\£{X,P)\\x[p{X)-po{X)A 

UeB:ll/3-/3nll<5n J 



<jEa 



sup \\£{X, 

./3eB: ||/3-/3o||<<5o 



X^EaMX)-po{X)]^} 



< const. x[\\p{.)-po{-)\\2,a?^'~''^/'-^^^^ 

for all ||p(.)-p„(.)||2,a = o(l). 

Now Proposition B.l(ii), A;„ = 0(n'^==/(^'^+'^^)) and j > dx imply [\\p{-) - 
Po(-)ll2,a]^(^~'^"/^^^)) = Opin~^/^), hence condition (*) is satisfied. □ 
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